Rawhed's Tutorial #5: MMX
Introduction
When Intel released MMX I thought that it sucked! I thought this until I tried it out a few months ago, and it's actually VERY cool. I think the reason why no one took it seriously was Intel's marketing. I mean look at the P3 - it's surposed to enhance your Internet experience. "What crap!" everyone thinks. Hehe. P3 has very cool SSE instructions, which is basically their reply to AMDs 3DNow! technology - but more about that in another tutorial.
I've seen a few demos starting to use MMX, which is very cool. My current demo engine checks if the machine has MMX/3DNow! and if so - uses the appropriate functions optimised using those instructions. I think I've started off badly with all my rambling, so let me give you a bit of background about MMX.
Basically MMX is a set of instructions for the Pentium range of machines and is Intels first big change to their x86 instruction set since the 386 (1985). There are 57 new instructions in all. The instructions are very good for multimedia type processing - things like audio, video, imagery etc. MMX also comes with 8 new 64 bit registers (well sort of) and uses SIMD (Single Instruction Multiple Data), which basically means that instructions can handle multiple data in parallel. No one is totally sure what MMX stands for because Intel has never said it, however it seems most people agree it's MultiMedia eXtenstions.
Most machines these days support MMX. I'm not sure if modernday compilers optimise using MMX, or if big commercial programs use MMX much - but they should! It's high time everyone accepts MMX as standard. Even Cyrix and AMD machines support it. I hope to see more MMX demo stuff too. :) Maybe it will help demos have more particles/polygons/whatever than ever before.
Anyways, enough gibberish - on with the tut.
MMX Registers
There are 8 new MMX registers. They are called MM0, MM1, MM2...MM7. They are actually not really new, because physically on the chip they are not there - instead they use the floating point stack. As you know the registers in the FP unit are 80 bits wide, The signed bit (bit 79) and the exponent part of each register are filled with 1's, and the remaining 64 bit of each FP register is where the MMX register lies.
So basically the MMX registers are aliased onto the floating point registers. What this means is that while using MMX can you can't use FP instructions. You have to call EMMS when finished a block of FP/MMX code.
MMX instructions work with the 64 bit registers in various ways - depending on the instruction. There are 4 new ways the instructions can look at the 64 bit data:
Packed Bytes - 64 bits divided into 8 bytes.
Packed Words - 64 bits divided into 4 words.
Packed Dwords - 64 bits divided into 2 dwords.
Quadword - 64 bits undivided.
So if you are using an instruction that works on packed bytes, it will perform 8 operations - one on each byte. Each byte will be treated as an independant entity and will not touch any of the other bytes. The same goes for the packed words and dwords.
MMX Instructions
MMX instructions are all pretty much formatted the same way:
instruction dest,source
The mmx instructions often have suffixes which describe
a) signed/unsigned operation
b) saturated/wraparound operation
c) whether the instruction works on packed bytes,words,dwords or qword.
For example the instruction padd can be used as:
paddusd MM2,mem1 (add unsigned, saturated operation using packed dwords)
paddb mem1,MM2 (add using wraparound, on packed bytes)
Some mmx instructions only work on certain types of datatypes, so I've indicated this when i describe the instruction.
Saturated vs Wraparound
You will see that some of the instructions support something called "saturation". This is a very cool new thing in mmx that stops wraparounds(overflows) from happening when you exceed the datarange limits. For example:
mov al,250
add al,10 ;al is now equal to 9. This is wraparound/overflow.
mov eax,250
mov ebx,10
movd MM0,eax
movd MM1,ebx
paddsb MM0,MM1
movd eax,MM0 ;eax = 255. This is saturation.
Similarly if we'd been dealing with 16 bits then they would saturate at 65535 and zero. If it's signed saturation, then the clipping values will be the signed limits of that datatype, e.g. for bytes: -127..127. In the above example of saturation, the paddsb is actually doing 8 additions and saturation, all at the same time!
MMX Instructions - EMMS
Since the MMX registers are using the same space that the FPU uses, they can't be used simultaneously. EMMS must always be called after a block of MMX code, otherwise when FP code is executed after it, stack overflows and wrong answers will arise as it'll be using residual MMX data. The only problem is that EMMS is very slow (50 cycles). AMD fixed this problem with their FEMMS instruction which does the same thing, except its a lot faster (5 cycles).
MMX Instructions - Moving 32/64 bits
These instuctions are very important because they are how you move data into/around/from the MMX registers. There are 2 MMX data moving instructions, MOVD and MOVQ. As I'm sure you've already worked out, MOVD moves 32 bits of data and MOVQ moves 64 bits. Here are their operands:
MOVD dest,src
MOVD MMXreg/x86reg/Mem,MMXreg/x86reg/Mem
MOVQ MMXreg/Mem,MMXreg/Mem
However you cannot have both the destination and source operands as memory addresses! This is a major bummer because otherwise nice fast 64 bit memory copies could be done.
MOVD can be used to load data into the MMX register from normal x86 registers. So if you want to move the 32 bit value of EAX into MM3 you would use MOVD MM3,eax. This fills the lower 32 bits of MM3 with eax value and fills the upper 32 bits with zeros. This is what is often used to getting data into the MMX registers for them to play with. After your MMX routine has been done and you want to get the result you can use a MOVD eax,MM3. MOVD can also be used to copy the lower 32 bits from one MMX register to another, however the upper 32 bits is zero filled.
MOVQ can't access normal 32 bit registers so you have to use MOVD to load/unload x86 register data. MOVQ is used to load/unload data to/from memory. For example if you have a ARGB memory buffer you can use MOVQ MM0,mem1 to load 2 pixels (64 bits) into register MM0. To put the data back once its been through you MMX routine just use something like MOVQ mem2,MM0.
MMX Instructions - Addition & Subtraction
PADD and PSUB are the base mmx addition and subtraction instructions. Applying suffixes to them allows you to specify whether you are wanting it to be a signed/unsigned and wraparound/saturated instruction. These instructions can accept MMX registers or memory addresses as source operands, but only MMX regisers as source operands:
PADDx dest,src
PADDx MMXreg,MMXreg/Mem
PSUBx MMXreg,MMXreg/Mem
Here are the add/sub instructions and on what datatypes they work:
PADD (packed wraparound add) - byte - word - dword
PADDS (packed signed saturated add) - byte - word
PADDUS (packed unsigned saturated add) - byte - word
PSUB (packed wraparound sub) - byte - word - dword
PSUBS (packed signed saturated sub) - byte - word
PSUBUS (packed unsigned saturated sub) - byte - word
Now that you know the datatypes that they work on you can just add the (b,w,d or q) suffix to the instructions, e.g.:
PADDB, PADDW, PADDD - each one for a different data type
PADDSB, PADDSW - each one for a different data type
PADDUSB, PADDUSW - each one ofr a different data type
So PADDB works on packed bytes and PADDW works on packed words - but how? This is the beauty of MMX - it does things in parallel. A PADDB will do 8 additions. Here is how:
MM0 - |008|000|005|000|255|000|001|045| 8 bytes (64 bits)
MM1 - |000|057|005|000|005|000|001|002| 8 bytes (64 bits)
PADDB MM0,MM1
result (mm1 unchanged):
MM0 - |008|057|010|000|004|000|002|047| 8 bytes (64 bits)
This will add each 8byte entity and put the resulting 8 bytes into the destination operand (MM0). PADDB is a wraparound instruction of course to 255+5=4;
If we were using the PADDSB instruction it would have worked the same, except for the 255+5:
MM0 - |008|000|005|000|255|000|001|045| 8 bytes (64 bits)
MM1 - |000|057|005|000|005|000|001|002| 8 bytes (64 bits)
PADDSB MM0,MM1
result (mm1 unchanged):
MM0 - |008|057|010|000|255|000|002|047| 8 bytes (64 bits)
One more example now, except using packed words and saturated subtraction:
MM0 - |001234|000010|000005|008516| 4 words (64 bits)
MM1 - |000001|000020|000001|009343| 4 words (64 bits)
PSUBSW MM0,MM1
result (mm1 unchanged):
MM0 - |001233|000000|000004|000000| 4 words (64 bits)
MMX Instructions - Shifting
These are instructions very similar to the old x86 SHL and SHR instructions only they are very cool because they work on the different packed formats to they can shift multiple values in one instruction. Here are the base shifting instructions and what datatypes they act on:
PSLL (Packed Shift Left Logical) - word - dword - qword
PSRA (Packed Shift Right Arithmetic) - word - dword
PSRL (Packed Shift Right Logical) - word - dword - qword
So once again (just like the padd & psub instructions) just add the suffixes to the base instruction name to get the instruction name that works on a certain datatype. E.g.:
PSLLW - does a left logical shift on the packed word datatype
PSRAD - does a right arithmetic shift on the packed dword datatype
These shifting instructions are all formatted the same way as the SHL and SHR instructions:
instruction dest, shiftamount
PSLLW MMXreg, MMXreg/Mem/Immed
e.g.: PSLLW MM1, 3
The PSLLx and PSRLx instructions are all basically the same. They shift the bits to the left/right and fills the low/high order bits with zeros. Here is an example of the PSRLW instruction:
MM4 (64 bits, 4words):
|0000100001001100|0000000000011111|0000011000001100|1111110000000000|
PSRLW MM4,5 (packed logical shift to the right by 5)
MM4 (64 bits, 4words):
|0000000001000010|0000000000000000|0000000000110000|0000011111100000|
As you can see from this example zeros fill the highorder bits and the loworder bits that shift right too far are killed. The PSLLx instruction works just like this except that it shift to the left and the low order bits the filled with zeros.
Both of those sets of instructions are called "logical" while the PSRAx instructions are "arithmetic". This is basically calling them unsigned and signed instructions. The arithmetic instruction takes into account whether the data is positive/negative. PSRAx shifts data to the right. If the data element is positive then it fills the high order bits of the destination with zeros. If the data elecment is negative then it fills the high order bits of the destination with ones. Remember that a data element is negative if its highest bit is signed(1). Here is an example of how it works:
MM4 (64 bits, 4words):
|0000100001001100|0000000000011111|1000011000001100|1111110000000000|
PSRAW MM4,5 (packed arithmetic shift to the right by 5)
MM4 (64 bits, 4words):
|0000000001000010|0000000000000000|1111100000110000|1111111111100000|
It's a GREAT pity that the shifting instructions don't work on the packed byte datatype otherwise we could shift 8 bytes at a time and if using ARGB data this would be invaluable! Oh well.. We can get around this by doing the hack mentioned later on in this tut, in the section called "some implementation ideas".
MMX Instructions - Logical Instructions
These are your MMX equivilent bitwise instructions like AND XOR NOT etc. They only work on 64 bits (qword) so the instruction is formatted:
instruction dest,src
instruction MMXreg,MMXreg/Mem
There are 4 MMX bitwise instructions: pand, pandn, por and pxor.
PAND works just like normal ANDing except that its being applies to 64 bits. To refresh and for example:
0 AND 0 = 0
1 AND 1 = 1
1 AND 0 = 0
0 AND 1 = 0
PANDN (Not AND) first inverts the bits of the destination then applies the logical AND.
0(1) ANDN 0 = 0
1(0) ANDN 1 = 0
1(0) ANDN 0 = 0
0(1) ANDN 1 = 1
POR:
0 OR 0 = 0
1 OR 1 = 1
1 OR 0 = 1
0 OR 1 = 1
PXOR (exclusive OR):
0 XOR 0 = 0
1 XOR 1 = 0
1 XOR 0 = 1
0 XOR 1 = 1
I often PXOR MM7,MM7 to make my MM7 register==0. This is very useful when doing packing/unpacking as you will see later.
MMX Instructions - Multiply
The 3 MMX multiplication instructions all operate on 16 bits of data and output 32 bit results of the multiplication. The 3 instructions are:
PMADD (Packed Multiply Add) - word-->dword
PMULH (Packed Multiply High) - word
PMULL (Packed Multiply Low) - word
All 3 work as:
instruction dest,src
instruction MMXreg,MMXreg/Mem
PMADDWD multiplies each of the 4 words in the source operand with each of the 4 words in the destination operand - producing 4 dwords. The lower two dwords are added together and stored as 1 dword in the lower 32 bits of the destination register. The same is done for the higher 2 dwords, except that they are stored in the highest 32 bits of the destination register. You can see why the suffix of the instruction is "WD", because it takes input of words but the output is in packed dwords. This instruction could be very useful for a variety of things. Complex number multiplicatin can benefit from this instruction immensely as is requires 4 multiplications and two additions. Also imagine how easy it could do 2 lots of (x*x)+(y*y) in parallel!
PMULHW multipies each of the 4 words in the source operand with each of the 4 words in the destincation operand. This again produces four 32 bit numbers, so it discards the lower 16 bits of each result and stores the higher 16 bits in the corresponding destination operand.
PMULLW does the same as PMULHW except it discards the higher 16 bits and stores the lower 16 bits of the multiplcaition result.
MMX Instructions - Comparing
Yes, there are even some new additions to the CMP family! :) They are quite weird, let me introduce them:
PCMPEQ (Packed Compare for Equality) - byte - word - dword
PCMPGT (PAcked Compare for Greater Than) - byte - word - dword
From that you can work out that all the actual instructions:
PCMPEQB, PCMPEQW, PCMPEQD
PCMPGTB, PCMPGTW, PCMPGTD
PCMPEQx compares the data elements (whatever their size) in the source operand to those in the destination operand. If they are equal, 1's are written to that part of the destincation operand, if not then 0's are written. So you end up with a destination operand comprised of zero and FF(PCMPEQB)/FFFF(PCMPEQW)/FFFFFFFF(PCMPEQD) data elements.
PCMPGTx does the same as PCMPEQx, except that if the data in the destination data element is greater than the data in the source data element, 1's are written to the destincation data element, otherwise 0's are written.
MMX Instructions - Packing/Unpacking
I've left a very important set of instructions till last. I'm not sure why.. perhaps its all about saving the best till last. These aren't a magical set of instructions which will make your coding amazingly fast. They are very important because they allow you do control the format of data going into the MMX registers so that you can use its parallelism. They are also pretty cool because some of them perform saturation too. Often your data don't be in the nice format needed for parallel number crunching, these instructions can convert it into this format. You then use your cool MMX function on it and unpack the number back out of its format.
PACKSS (Pack Signed Saturated) - byte<-word - word<-dword
PACKUS (Pack Unsigned Saturated) - byte<-word
PUNPCKH (Unpack High Data) - byte->word - word->dword - dword->qword
PUNPCKL (Unpack Low Data) - byte->word - word->dword - dword->qword
This whole "-->" thing might seem confusing, buts its the same as it is for the PMADDWD instruction. Look at PACKSS and PUNPCKH, it means that the instructions are:
PACKSSWB, PACKSSDW
PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ
The packing instructions take larger data elements and convert them to smaller data elements(eg word<--dword). The unpacking instructions take smaller data elements and convert them to larger data elements.
PACKUSWB: First off a saturation check is performed on the data elements in the source and destination operands. If the word is negative is makes it 0, and if the word is greater than 255(the maximum size of a byte) it clips it to 255. Now in the source and destination operands you have 16 bit words with values between 0 and 255. All it does now is collect these 8 bytes and put them into the destination operand. First the destination 4 words and put into the first 4 bytes of the destination operand, and then the 4 words of the source and put into the second 4 bytes of the destination operand. Here is how:
MM0 - |0008000|0000005|000230|0001045| 4 words (64 bits)
MM1 - |00000-5|0024525|002345|0000112| 4 words (64 bits)
PACKUSWB MM0,MM1 (Pack Unsigned Saturated Words to Bytes)
result of saturation:
(done in processor so MM1 doesn't actually change)
MM0 - |0000255|0000005|000230|0000255| 4 words (64 bits)
MM1 - |0000000|0000255|000255|0000112| 4 words (64 bits)
final result:
MM0 - |000|255|255|112|255|005|230|255| 8 bytes (64 bits)
PACKSSx does the same as PACKUSWB except that because its signed it looked to saturate if the number is bigger than 127 and if its less than -127. Also it can convert from dword to word.
PUNPCKHx only works on the higher 32 bits of the destination and source operands. It takes data elements from each and intertwines them into the destincation operand. Here is an example (in hex for convenience):
MM0 |AF|45|0E|8A|12|67|FF|00| 8 bytes (64 bits)
MM1 |11|91|AB|5C|93|B8|0F|09| 8 bytes (64 bits)
PUNPCKHBW MM0,MM1 (Unpack High Data from Bytes to Words)
result:
MM0 |11AF|9145|AB0E|5C8A| 8 bytes (64 bits)
PUNPCKLx does the same except that it takes data from the lower 32 bits of the source and destination operand. E.g.:
MM0 |AF|45|0E|8A|12|67|FF|00| 8 bytes (64 bits)
MM1 |11|91|AB|5C|93|B8|0F|09| 8 bytes (64 bits)
PUNPCKLBW MM0,MM1 (Unpack Low Data from Bytes to Words)
result:
MM0 |9312|B867|0FFF|0900| 8 bytes (64 bits)
One might wonder why on earth you need such WEIRD instructions? Well let me give you an example. Now let's say we want to add and average 2 ARGB pixels together from different memory locations. Now if we put 2 pixels into different MMX registers and tried to add them(without saturation) we would get overflows and all sorts of weird things happening. Look:
movd MM0,[edi] ;load pixel 1
movd MM1,[esi] ;load pixel 2
punpcklbw MM0,MM7 ;copy the lower 32 bits of MM0 into MM0
punpcklbw MM1,MM7 ;copy the lower 32 bits of MM1 into MM1
paddusw MM0,MM1
psrlw MM0,1 ;/2
packuswb MM0,MM7
movd [esi],MM0
Little Tips
These are little tips/tricks I've made myself and most I've collected:
Making an MMX register=0
PXOR MM0, MM0
Filling all 64 bits of a MMX register with 1s.
PCMPEQ MM1, MM1
Compute the absolute difference of 2 unsigned numbers.
(Assuming packed-byte or packed-words.)
Input:
MM0: source operand
MM1: source operand
Output:
MM0: The absolute difference of the unsigned operands
MOVQ MM2, MM0 ; make a copy of MM0
PSUBUSB MM0, MM1 ; compute difference one way
PSUBUSB MM1, MM2 ; compute difference the other way
POR MM0, MM1 ; OR them together
Some Implementation Ideas - Vector Rotation
LSD/Meltdown actually gave me this idea after implementing it in his 3D engine. Whether you're using a 12 or 9 multiplication rotation formula, you can execute those multiplations in parallel - making it a lot faster.
Some Implementation Ideas - ARGB pixels
The Alpha-Red-Green-Blue pixel format is perfect for fast manipulation with MMX. Doing a 64 bit read you can load 2 of these pixels into each register. From there on you are free to use MMXs parallelism to the max. You can now process up to 4 32 bit pixels per instruction - adding them, multiplying them, subtracting - and all with or without automatic saturation.
Here is an example of a 320x200x32bpp loop which additively copies a buffer onto another buffer, and saturates the RGB at 255:
;ASM 32bpp MMX adding
mov edi,[dest]
mov esi,[src]
mov ecx,32000
@MMX_layeraddloop:
movq MM0,[edi] ;Move QUAD(64 bits)
movq MM1,[esi] ;Move QUAD(64 bits)
paddusb MM0,MM1 ;Saturated Add
movq [esi],MM0 ;Move QUAD(64 bits)
add esi,8
add edi,8
dec ecx
jnz @MMX_layeraddloop
EMMS ;Must always do this after about of
;MMX instructions
It's fast.
Some Implementation Ideas - Byte Shifting
One of the weird things with MMX shifting is that it doesn't do shifting for the byte data element. This would be very handly for things like ARGB pixel manipulation. What this forces you do to, is load the 2 pixels in 2 registers, unpack them to words, manipulate them, then repack them. Very long process. I've got a method which is a bit of a hack (as usual), but is faster.
Look at this simple example below of shifting:
source data: |pppaaa|hhhrrr| (1 word)
word shift 2: |00pppa|aahhhr| (1 word)
byte shift 2: |00pppa|00hhhr| (1 word)
You can see that the only difference between word and byte shifting is that there are zeros in the byte shift where the overflows occur from the word shift. This can easily be eliminated by masking those bits off. So depending on the abount we shift by, and the direction of the shift, a different mask will have to be used:
shr1mask = 0111111101111111011111110111111101111111011111110111111101111111b
shr2mask = 0011111100111111001111110011111100111111001111110011111100111111b
shr3mask = 0001111100011111000111110001111100011111000111110001111100011111b
shl1mask = 1111111011111110111111101111111011111110111111101111111011111110b
shl2mask = 1111110011111100111111001111110011111100111111001111110011111100b
shl3mask = 1111100011111000111110001111100011111000111110001111100011111000b
In most tight loops the shift amount and direction is fixed, so you can use this method, however where the shift isn't constant it won't be so good. Here is an example of loading two 32 bit pixels and byte shifting it using this method:
MOVQ MM7,[shr3mask] ;loads 64 bit mask
MOVQ MM0,[edi] ;loads 2 32 bit pixels
PSRLW MM0,3 ;shifts word elements 3 to the right.
PAND MM0,MM7 ;mask off irrelevant bits
MOVQ [edi],MM0 ;put modified pixels back.
Some Implementation Ideas - Crossfading
On the sademoscene mailing list Jacques posted an interesting challenge. He wanted to find a fast implimentation of the alpha blend function - basically a crossfader. The functions formula is:
a=ARGB pixel1
b=ARGB pixel2
alpha=(0..1 value of the percentage of each image to blend)
finalpixel=[alpha*(a-b)]+b
You can see that if alpha==0 then 100% of image b will be shown and 0 percent of image a will be shown. Also if alpha==1, 100% of image a will be shown and 0 percent of image b will be shown. Now how to speed up this very useful algorithm using MMX? First of all, lets rewrite the algorithm to remove the floating point alpha value:
alpha=alpha<<8; //scale it up by 256.
so now:
finalpixel=b+[alpha*(a-b)]>>8;
There are 4 main parts to the formula:
1 : (a-b)
2 : *alpha
3 : >>8
4 : +b
Try to make your own implementation of it using MMX before looking how I did it. I looked for what I could use for each of the 4 steps:
1 : (a-b) : could use a PSUB routine for bytes.
2 : *alpha : would have to do 2 multiplications(one high and one low).
3 : >> 8 : shift
4 : +b : simply a PADD instruction for bytes.
Here is how I did it:
;assume alpha is a value 0..255
;assume edi points to buffer1
;assume esi points to buffer2
;asuume edx points to destination buffer
pxor MM6,MM6 ;make MM6==0
mov eax,[alpha]
mov ebx,eax
shl ebx,16
add eax,ebx
movd MM7,eax
movq MM6,MM7
punpckldq MM7,MM6 ;MM7=alpha
;//////////inner LOOP/////////////
movq MM0,[esi] ;pixels a
movq MM1,[edi] ;pixels b
punpcklbw MM0,MM6 ;byte-->word(pixel 1)
punpcklbw MM1,MM6 ;byte-->word(pixel 1)
psubw MM0,MM1 ;a-b
pmullw MM0,MM7 ;pixel 1 * alpha
PSRLW MM0,8 ;shifts word elements 8 to the right.
paddb MM0,MM1 ;add (b) to result
packuswb MM0,MM6 ;convert back into byte form
movq [edx],MM0
;//////////inner LOOP/////////////
It's weird, I know. :) If you know a better way - and I've no doubt there is one, please let me know.
Some Implementation Ideas - Blurring
Here is my MMX bluring routine:
_MMX_blur_:
push edi
mov edi,[destaddr]
mov ecx,256000 ;320x200x4
sub ecx,2564
add edi,1284
pxor MM7,MM7 ;=0
movd MM0,[edi-4]
@blur_more:
movd MM1,[edi+4]
movd MM2,[edi-1280]
movd MM3,[edi+1280]
punpcklbw MM0,MM7
punpcklbw MM1,MM7
punpcklbw MM2,MM7
punpcklbw MM3,MM7
paddusw MM0,MM1
paddusw MM0,MM2
paddusw MM0,MM3
psrlw MM0,2
packuswb MM0,MM7
movd eax,MM0
stosd
sub ecx,4
jnz near @blur_more
EMMS
pop edi
ret
I think it can be optimised a lot, especially since it doesn't operate on pixels in parallel.
Some Implementation Ideas - Complex Multiplications
MMX can be VERY useful for doing complex multiplications - which is useful in things like fractals. Now I'm no expert on imaginary number planes or anything like that so this is straight out of an Intel document:
Let the input data be Dr and Di where
Dr = real component of the data
Di = imaginary component of the data
Format the constant complex coefficients in memory as four 16-bit
values [Cr -Ci Ci Cr]. Remember to load the values into the MMX technology
register using a MOVQ instruction.
Input:
MM0: a complex number Dr, Di
MM1: constant complex coefficient in the form[Cr-Ci Ci Cr]
Output:
MM0: two 32-bit dwords containing [ Pr Pi ]
The real component of the complex product is Pr = Dr*Cr - Di*Ci, and the imaginary component of the complex product is Pi = Dr*Ci + Di*Cr
PUNPCKLDQ MM0,MM0 ; This makes [Dr Di Dr Di]
PMADDWD MM0, MM1 ; and you're done, the result is
; [(Dr*Cr-Di*Ci)(Dr*Ci+Di*Cr)]
Note that the output is a packed word. If needed, a pack instruction can be used to convert the result to 16-bit (thereby matching the format of the input).
Closing Words
Well I really hope that people start using MMX more - because it REALLY is very cool. If you'd like to comment to me about anything in this document, please don't hesitate! I think next time I'll look into 3DNow! which is AMDs new set of floating point instructions. Just to give you a taste - 3DNow!'s registers are also MM0-MM7, except that the 64 bits is divided into 2 single precision floating point numbers. What this means is that you can do floating point functions in parallel - much like MMX. There is also new few much needed MMX instructions which is included in AMDs 3DNow!, and in the new PIII range, which are just an extension to the integer MMX instructions mentioned in this doc.
Greets: Demoscene, All the people at Optimise'99, Everyone at #programming, ColdBlood, Cyberphreak, Deadpoet, LSD, Maverick, Neuron, NiMH, Saurax, Viper, and everybody that I know!